%>%%>%“Web scraping is the process of automatically mining data or collecting information from the World Wide Web” – Wikipedia
Web scraping is a flexible method to extract numerical or textual data from the internet
There are many uses for web scraping, including:
You can easily check with the robotstxt package
Netflix does not allow you to scrape their site
Hyper Text Markup Language
“HTML is the standard markup language for creating Web pages”
Cascading Style Sheets
“CSS describes how HTML elements are to be displayed on screen, paper, or in other media”
– W3Schools
Image credit: Professor Shawn Santo
HTML is structured with “tags” indicating portions of a page
Tags can be called by their structure
Tags can be nested
A few important tags (of many) for scraping:
<h1> header tags </h1><p> paragraph elements </p><ul> unordered bulleted list </ul><ol> ordered list </ol><li> individual list item </li><div> division </div><table> table </table>Extracting parts of a website can be daunting if unfamiliar with CSS
SelectorGadget is helpful (Chrome only)
Inspect the page elements is also helpful
HTML - syntax is easier & aligns with HTML tags
XPATH - useful when the node isn’t uniquely identified with CSS
That’s it!
Seems appropriate to pull R book data from Amazon
We are good to scrape!
amazon <- read_html("https://www.amazon.com/s?k=R&i=stripbooks&rh=n%3A283155%2Cn%3A75%2Cn%3A13983&dc&qid=1592086532&rnid=1000&ref=sr_nr_n_1")Data as of 2020-07-07
amazon %>%
html_nodes(".s-line-clamp-2") %>%
html_text() -> titles
head(titles)
#> [1] "\n \n \n \n\n\n\n\n\n \n \n \n R for Data Science: Import, Tidy, Transform, Visualize, and Model Data\n \n \n \n \n\n\n \n"
#> [2] "\n \n \n \n\n\n\n\n\n \n \n \n The Book of R: A First Course in Programming and Statistics\n \n \n \n \n\n\n \n"
#> [3] "\n \n \n \n\n\n\n\n\n \n \n \n Discovering Statistics Using R\n \n \n \n \n\n\n \n"
#> [4] "\n \n \n \n\n\n\n\n\n \n \n \n R Graphics Cookbook: Practical Recipes for Visualizing Data\n \n \n \n \n\n\n \n"
#> [5] "\n \n \n \n\n\n\n\n\n \n \n \n Advanced R, Second Edition (Chapman & Hall/CRC The R Series)\n \n \n \n \n\n\n \n"
#> [6] "\n \n \n \n\n\n\n\n\n \n \n \n Analyzing Baseball Data with R, Second Edition (Chapman & Hall/CRC The R Series)\n \n \n \n \n\n\n \n"\n & white space from the titlestitles <- str_trim(titles) # Removes leading & trailing space
head(titles)
#> [1] "R for Data Science: Import, Tidy, Transform, Visualize, and Model Data"
#> [2] "The Book of R: A First Course in Programming and Statistics"
#> [3] "Discovering Statistics Using R"
#> [4] "R Graphics Cookbook: Practical Recipes for Visualizing Data"
#> [5] "Advanced R, Second Edition (Chapman & Hall/CRC The R Series)"
#> [6] "Analyzing Baseball Data with R, Second Edition (Chapman & Hall/CRC The R Series)"amazon %>%
html_nodes("a.a-size-base.a-link-normal.a-text-bold") %>%
html_text() -> format
head(format)
#> [1] "\n \n \n \n Paperback\n \n \n"
#> [2] "\n \n \n \n Kindle\n \n \n"
#> [3] "\n \n \n \n Paperback\n \n \n"
#> [4] "\n \n \n \n eTextbook\n \n \n"
#> [5] "\n \n \n \n Paperback\n \n \n"
#> [6] "\n \n \n \n Kindle\n \n \n"amazon %>%
html_nodes("div.a-row.a-size-small") %>%
html_text() -> rate_n
head(rate_n)
#> [1] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 427\n \n \n \n \n\n\n\n"
#> [2] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.3 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 76\n \n \n \n \n\n\n\n"
#> [3] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.5 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 255\n \n \n \n \n\n\n\n"
#> [4] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 14\n \n \n \n \n\n\n\n"
#> [5] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.8 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 31\n \n \n \n \n\n\n\n"
#> [6] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.4 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 14\n \n \n \n \n\n\n\n"rate_n <- str_trim(rate_n)
head(rate_n)
#> [1] "4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 427"
#> [2] "4.3 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 76"
#> [3] "4.5 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 255"
#> [4] "4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 14"
#> [5] "4.8 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 31"
#> [6] "4.4 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 14"Let’s assemble the file!
length(titles)
#> [1] 16
length(format)
#> [1] 36
length(price)
#> [1] 36
length(rating)
#> [1] 14
length(rate_n)
#> [1] 14
length(pub_dt)
#> [1] 16Wait! What?!?
Sometimes you get an uneven number of records in the scrape
We can fix this!
…manually…
Titles scraped accurately, but have multiple formats
Some books have 3 formats
Nothing needed here!
Or here!
Books missing ratings
Multiple formats - repeat ratings
Books with 3 formats
Books missing ratings & rating counts
Multiple formats - repeat rating counts
Books with 3 formats
Multiple formats - repeat publication dates
Books with 3 formats
r_books <- tibble(title = titles,
text_format = format,
price = price,
rating = rating,
num_ratings = rate_n,
publication_date = pub_dt)
head(r_books)
#> # A tibble: 6 x 6
#> title text_format price rating num_ratings publication_date
#> <chr> <chr> <dbl> <dbl> <dbl> <date>
#> 1 R for Data Science: Imp~ Paperback 40.1 4.7 427 2017-01-10
#> 2 R for Data Science: Imp~ Kindle 25.0 4.7 427 2017-01-10
#> 3 The Book of R: A First ~ Paperback 33.0 4.3 76 2016-07-16
#> 4 The Book of R: A First ~ eTextbook 30.0 4.3 76 2016-07-16
#> 5 Discovering Statistics ~ Paperback 34.4 4.5 255 2012-04-05
#> 6 Discovering Statistics ~ Kindle 61.6 4.5 255 2012-04-05Web Scraping in R & rvest on GitHub
This talk is freely distributed under the MIT License.